Appendix
#Load packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
pacman::p_load(caret)
pacman::p_load(modelr)
#Parse csv, save to variable
file_path <- "/Users/mudaphilly/code/UoA/Data Taming/Assignment 3 - Case study/melbourne.csv"
melb_raw <- read.csv(file_path)
melb_raw <- as_tibble(melb_raw)
# Rename columns
melb_raw <- melb_raw %>%
rename(min_temp = Minimum.temperature..Deg.C.,
max_temp = Maximum.Temperature..Deg.C.,
rh_9am = X9am.relative.humidity....)
# Add seperate day, month, year with mutate() and regex
melb_raw <- melb_raw %>%
mutate(year = as.integer(str_extract(Date, "^\\d{4}")),
month = as.integer(str_extract(Date, "(?<=-)\\d{2}(?=-)")),
day = as.integer(str_extract(Date, "(?<=-)\\d{1,2}$")),
weekday = wday(as.Date(Date, format = "%Y-%m-%d"), label = TRUE)) %>%
select(Date, year, month, day, weekday, everything())
# More Tidying. Select bivariates, Save to new dataframe
melb_tidy <- melb_raw %>%
select(Date, month, weekday, Evaporation..mm., min_temp, max_temp, rh_9am)
# Temps to float, months to strings, Let R know month and weekday are categorical variables
melb_tidy <- melb_tidy %>%
mutate(evaporation = as.numeric(Evaporation..mm.),
min_temp = as.numeric(min_temp),
max_temp = as.numeric(max_temp)) %>%
mutate(month = month.name[month]) %>%
mutate(month = factor(month)) %>%
mutate(weekday = factor(weekday))
# Change evap and rainfall variable names
melb_tidy <- melb_tidy %>%
select(-Evaporation..mm.)
# Remove rows with NAs from data frame
melb_tidy <- na.omit(melb_tidy)
# Inspect
head(melb_tidy)
## # A tibble: 6 × 7
## Date month weekday min_temp max_temp rh_9am evaporation
## <chr> <fct> <ord> <dbl> <dbl> <int> <dbl>
## 1 2019-01-1 January Tue 15.5 26.2 74 7
## 2 2019-01-2 January Wed 18.4 22.2 64 7
## 3 2019-01-3 January Thu 15.9 29.5 75 6.6
## 4 2019-01-4 January Fri 18 42.6 31 7.8
## 5 2019-01-5 January Sat 17.4 21.2 63 15.4
## 6 2019-01-6 January Sun 14.6 22.1 55 6.4
Assessment of Distribution: The data has a symmetric shape, with evaporation rates highest between November to February and at their lowest ebb between May to August (Figure 1). The spread of the data matches this trend, with a general greater spread November to February and reduced spread May through August. The number of outliers matches the trend for shape and spread. More are observed November to February, with fewer to zero May through August.
The monthly median can be seen by the red dots (Figure 1). Evaporation median decreases from January to June, before increasing again through December. March and November, the months at the end and beginning of Summer, are the exception to this rule in this dataset. They can be seen to have very marginal increases and decreases respectively (Figure 1).
Assessment of Significance: The p-value is smaller than 0.05, so we reject the null hypothesis. Adjusted R-squared is 31%. Linearity and homoscedasticity tests are moderately good (evap_l1, evap_l2). Normality trends up towards positive 2 but is otherwise quite good (evap_l3).
#Evaporation and Month
melb_tidy <- melb_tidy %>%
mutate(month = factor(month, levels = month.name))
evap_plot <- ggplot(melb_tidy, aes(x = month, y = evaporation)) +
geom_boxplot() +
stat_summary(fun=median, geom="line", aes(group=1), color = "blue") +
stat_summary(fun=median, geom="point", color = "red") +
labs(title = "Figure 1. Evaporation (mm) by Month",
x = "Month",
y = "Evaporation (mm)") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
evap_plot
# Inspect relationship (can month be used to predict evaporation)
month.lm <- lm(evaporation ~ month, data = melb_tidy)
summary(month.lm)
##
## Call:
## lm(formula = evaporation ~ month, data = melb_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3643 -1.6433 -0.5643 0.9517 12.6194
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.6387 0.5172 16.704 < 2e-16 ***
## monthFebruary -1.2744 0.7507 -1.698 0.090482 .
## monthMarch -1.2581 0.7314 -1.720 0.086306 .
## monthApril -3.3954 0.7374 -4.604 5.83e-06 ***
## monthMay -5.2387 0.7314 -7.163 4.81e-12 ***
## monthJune -6.6987 0.7374 -9.084 < 2e-16 ***
## monthJuly -5.9587 0.7740 -7.698 1.47e-13 ***
## monthAugust -5.2484 0.7314 -7.176 4.42e-12 ***
## monthSeptember -4.3520 0.7374 -5.901 8.61e-09 ***
## monthOctober -2.7355 0.7314 -3.740 0.000215 ***
## monthNovember -2.2920 0.7374 -3.108 0.002040 **
## monthDecember -1.7904 0.7439 -2.407 0.016614 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.879 on 345 degrees of freedom
## Multiple R-squared: 0.3408, Adjusted R-squared: 0.3198
## F-statistic: 16.21 on 11 and 345 DF, p-value: < 2.2e-16
# Check linear relationship assumptions
evap_l1 <- plot(month.lm, which = 1)
evap_l2 <- plot(month.lm, which = 2)
evap_l3 <- plot(month.lm, which = 3)
# Bivariate analysis - Evaporation vs Week Days As all individual
weekdays are above 0.05. The anova p-value is 0.6193. Both results
confirm there is no statistically significant relationship between
weekdays and evaporation rates. The negative r-value (-0.004441) further
confirms there is no relationship.
There is no linear relationship evident in Figure 1.2.
With no statistically significant relationship, weekday can be dropped as a predictor, and no further investigation conducted.
# Adjust weekday factor levels
melb_tidy %>%
mutate(weekday = factor(weekday, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
## # A tibble: 357 × 7
## Date month weekday min_temp max_temp rh_9am evaporation
## <chr> <fct> <ord> <dbl> <dbl> <int> <dbl>
## 1 2019-01-1 January Tue 15.5 26.2 74 7
## 2 2019-01-2 January Wed 18.4 22.2 64 7
## 3 2019-01-3 January Thu 15.9 29.5 75 6.6
## 4 2019-01-4 January Fri 18 42.6 31 7.8
## 5 2019-01-5 January Sat 17.4 21.2 63 15.4
## 6 2019-01-6 January Sun 14.6 22.1 55 6.4
## 7 2019-01-7 January Mon 17.1 23.1 55 9
## 8 2019-01-8 January Tue 16.7 24.1 72 7.2
## 9 2019-01-9 January Wed 16.1 20.5 62 7.4
## 10 2019-01-10 January Thu 13.5 21.4 53 8.2
## # ℹ 347 more rows
# Check factors
print(levels(melb_tidy$weekday))
## [1] "Sun" "Mon" "Tue" "Wed" "Thu" "Fri" "Sat"
# Visualise relationship
weekday_plot <- ggplot(melb_tidy, aes(x = weekday, y = evaporation)) +
geom_boxplot() +
stat_summary(fun=median, geom="line", aes(group=1), color = "blue") +
stat_summary(fun=median, geom="point", color = "red") +
labs(title = "Figure 1.2. Effects of Weekday on Evaporation (mm)",
x = "Weekday",
y = "Evaporation (mm)") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
weekday_plot
# Inspect relationship (can weekday be used to predict evaporation)
weekday.lm <- lm(evaporation ~ weekday, data = melb_tidy)
summary(weekday.lm)
##
## Call:
## lm(formula = evaporation ~ weekday, data = melb_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3380 -2.6431 -0.7373 1.6846 14.6620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3116 0.1852 28.679 <2e-16 ***
## weekday.L 0.4365 0.4886 0.893 0.372
## weekday.Q 0.1724 0.4891 0.352 0.725
## weekday.C 0.7641 0.4900 1.559 0.120
## weekday^4 0.1164 0.4898 0.238 0.812
## weekday^5 -0.4442 0.4914 -0.904 0.367
## weekday^6 -0.2287 0.4912 -0.466 0.642
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.499 on 350 degrees of freedom
## Multiple R-squared: 0.01249, Adjusted R-squared: -0.004441
## F-statistic: 0.7377 on 6 and 350 DF, p-value: 0.6196
anova(weekday.lm)
## Analysis of Variance Table
##
## Response: evaporation
## Df Sum Sq Mean Sq F value Pr(>F)
## weekday 6 54.2 9.0314 0.7377 0.6196
## Residuals 350 4285.1 12.2432
Assessment of Bi-variate Distribution: There is a strong positive linear trend from left to right evident, with the majority of max_temp results observed below 30 Degrees C (Figure 1.3). The median is 19.70 and mean is 20.87. The lower interquartile range is 16.10 and upper is 23.90, giving an IQR of 7.8. There are several outliers in the higher end of the range which are located above 35 Degrees C (Figure 1.3).
Assement of Linear Regression Assumptions: When linear regression assumptions are tested, all 3 measures fail (maxt_l1, maxt_l2, maxt_l1). The distribution of max_temp is left skewed (maxt_distribution). When the max_temp variable is logged, the skewdness is resolved (max_temp_log_dist). Improvements are also noted in normality(maxt_log_l1) and homoscadasticity (maxt_log_l3 )
Assessment of Significance: With the p-value below 0.05, there is a statistically strong relationship between max_temp and evaporation. The r-squared value suggests the max_temp has a predictor value of 32%. The Pearson correlation is 0.578.
Summary: Figure 1.3 and the significant p-value show a strong relationship between maximum temperature and evaporation. As temperatures increase, so does the evaporation rate.The Pearson correlation, r-squared value and linear assumption tests do, however, demonstrate that the linear relationship contains some variation.
There are data distribution issues with the variable, notably it fails linear regression assumptions (Figure 1.4). For this reason, the max_temp variable was log transformed (Figure 1.5). For all further analysis, the max_temp log has been log transformed (Figure 1.6).
# Visualise relationship
maxt_plot <- ggplot(melb_tidy, aes(x = max_temp, y = evaporation)) +
geom_point() +
geom_smooth(method = "lm", se =TRUE, color = "blue") +
labs(title = "Figure 1.3. Effects of Max Temp (C) on Evaporation (mm)",
x = "Max Temp (Degrees C)",
y = "Evaporation (mm)")
maxt_plot
## `geom_smooth()` using formula = 'y ~ x'
# Inspect spread
summary(melb_tidy$max_temp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.60 16.20 20.00 20.87 23.90 42.80
sd(melb_tidy$max_temp)
## [1] 6.224088
# Inspect relationship (can max_temp be used to predict evaporation)
cor(melb_tidy$evaporation, melb_tidy$max_temp, method="pearson")
## [1] 0.5785657
sd(melb_tidy$max_temp)
## [1] 6.224088
max_temp.lm <- lm(evaporation ~ max_temp, data = melb_tidy)
summary(max_temp.lm)
##
## Call:
## lm(formula = evaporation ~ max_temp, data = melb_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8006 -2.0065 -0.3862 1.2901 9.9806
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.46074 0.52875 -2.763 0.00603 **
## max_temp 0.32454 0.02428 13.365 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.852 on 355 degrees of freedom
## Multiple R-squared: 0.3347, Adjusted R-squared: 0.3329
## F-statistic: 178.6 on 1 and 355 DF, p-value: < 2.2e-16
# Check the linear regression assumptions
maxt_l1 <- plot(max_temp.lm, which = 1)
cat("Therer is a pattern and residual line is not zero. Linearity failed")
## Therer is a pattern and residual line is not zero. Linearity failed
maxt_l2 <- plot(max_temp.lm, which = 2)
cat("There are upwards trends at the extremes. Normality failed")
## There are upwards trends at the extremes. Normality failed
maxt_l3 <- plot(max_temp.lm, which = 3)
cat("The line is linear, not flat. homoscedasticity failed")
## The line is linear, not flat. homoscedasticity failed
# Assumptions failed. Inspect the data distribution (skewdness)
maxt_distribution <- ggplot(melb_tidy, aes(x = max_temp)) +
geom_histogram(col = "black", fill = "blue", bins = 30) +
ggtitle("Figure 1.4: Distribution of Maximum Temperature (°C)")
maxt_distribution
# Investigate effects of log on data distribution (skwedness)
# Log transform max_temp
melb_tidy <- melb_tidy %>% mutate(max_temp_log = log(max_temp))
# Inspect distribution
max_temp_log_dist <- ggplot(melb_tidy, aes(max_temp_log)) + geom_histogram(col = "black", fill = "blue") + ggtitle(label = "Figure 1.5 Max_Temp - Log Transformed Distribution")
max_temp_log_dist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Re-plot Max Temp with log vs Evaporation
maxt_log_plot <- ggplot(melb_tidy, aes(x = log(max_temp), y = evaporation)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "blue") +
labs(title = "Figure 1.6. Max_Temp(log) (Degrees C) vs. Evaporation (mm)",
x = "log(max_temp)",
y = "Evaporation")
maxt_log_plot
## `geom_smooth()` using formula = 'y ~ x'
# Re-inspect relationship with log
max_temp_log.lm <- lm(evaporation ~ log(max_temp), data = melb_tidy)
summary(max_temp_log.lm)
##
## Call:
## lm(formula = evaporation ~ log(max_temp), data = melb_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1495 -1.8934 -0.4657 1.2836 10.1984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.595 1.599 -9.756 <2e-16 ***
## log(max_temp) 6.977 0.531 13.138 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.868 on 355 degrees of freedom
## Multiple R-squared: 0.3272, Adjusted R-squared: 0.3253
## F-statistic: 172.6 on 1 and 355 DF, p-value: < 2.2e-16
# Very minimal improvements. R-value worse. Inspect assumptions
maxt_log_l1 <- plot(max_temp_log.lm, which = 1)
cat("Much better, however still drops off 0 at upper end")
## Much better, however still drops off 0 at upper end
maxt_log_l2 <- plot(max_temp_log.lm, which = 2)
cat("Much better, however still a trend at upped end")
## Much better, however still a trend at upped end
maxt_log_l3 <- plot(max_temp_log.lm, which = 3)
cat("There is a linear line here. Fails the homoscedasticity test")
## There is a linear line here. Fails the homoscedasticity test
Assessment of Distribution: There is a strong positive linear trend from left to right evident, with the majority of results falling between 5 and 22 degrees C (Figure 1.7). There are two outliers clearly evident, both occurring above 25 Degrees C (Figure 1.8)
The median is 11.4 and mean is 11.83. The lower quartile range is 8.6 and upper is 14.8, giving an IQR of 6.2. This is a high value in context to a mean of 11.83, suggesting there is high spread in the data.
Assessment of Significance: With the p-value below 0.05, there is a statistical strong relationship between max_temp and evaporation. The r-squared value suggests the max_temp has a predictor value of 42%.The Pearson correlation is 0.6557644, which is moderate to strong.
Summary: Figure 1.7, the Pearson correlation and the significant p-value display a strong relationship between minimum temperature and evaporation. As temperatures decrease, so does the evaporation rate. Unlike max_temp, the linear assumptions are close to passing. The r-squared value does suggest there is, however, some variation in the relationship.
# Visualise relationship
mint_plot <- ggplot(melb_tidy, aes(x = min_temp, y = evaporation)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "blue") +
labs(title = "Figure 1.7. Effects of Min Temp (Degrees C) on Evaporation (mm)",
x = "Min Temp (Degrees C)",
y = "Evaporation (mm)")
mint_plot
## `geom_smooth()` using formula = 'y ~ x'
#Inspect the spread
summary(melb_tidy$min_temp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.80 8.60 11.40 11.83 14.80 25.10
sd(melb_tidy$min_temp)
## [1] 4.518612
# Inspect relationship (can min temp be used to predict evaporation)
cor(melb_tidy$evaporation, melb_tidy$min_temp, method="pearson")
## [1] 0.6550878
min_temp.lm <- lm(evaporation ~ min_temp, data = melb_tidy)
# Check the linear regression assumptions
plot(min_temp.lm, which = 1)
cat("The line is quite Linear")
## The line is quite Linear
plot(min_temp.lm, which = 2)
cat("The line displays evidence of normality. Slight upward trend at the upper extreme")
## The line displays evidence of normality. Slight upward trend at the upper extreme
plot(min_temp.lm, which = 3)
cat("The line shows some trend. Homoscedasticity failed")
## The line shows some trend. Homoscedasticity failed
# Check the distribution
min_temp_dist <- ggplot(melb_tidy, aes(min_temp)) + geom_histogram(col = "black", fill = "blue") + ggtitle(label = "Figure 1.8 Minimum Temperature Distribution")
min_temp_dist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Bivariate analysis - Humidity at 9am
Assessment of Distribution: There is a strong linear trend from left to right, with most data falling between 40 to 100% humidity (Figure 1.9). There are several outliers on the lower end of the data, all falling below the 40% humidity level (Figure 1.10). The median is 68 and mean is 88.2. The lower interquartile range is 61 and upper is 77, resulting in an IQR of 16. In relation to the mean, this is quite low, which suggests low variation in the data.
Assessment of Significance: The p-value is very low and well below 0.05, suggesting a statistically significant relationship between humidity percentage and evaporation rates. It should, however, be noted that the R-squared value of 27% and the Pearson correlation (-0.525713) suggest the linear relationship is not perfect.
Summary: Figure 1.9 and the significant p-value display a strong relationship between relative humidity and evaporation. As humidity decreases, the evaporation rate increases. The assessment of linear assumptions and the R-squared value do, however, suggest there is not a perfect linear relationship.
# Visualise relationship
humid_plot <- ggplot(melb_tidy, aes(x = rh_9am, y = evaporation)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "blue") +
labs(title = "Figure 1.9. Effects of Humidity(%) on Evaporation (mm)",
x = "Relative Humidity (%)",
y = "Evaporation (mm)")
humid_plot
## `geom_smooth()` using formula = 'y ~ x'
# Inspect spread
summary(melb_tidy$rh_9am)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30.00 61.00 68.00 68.36 77.00 100.00
sd(melb_tidy$rh_9am)
## [1] 13.58471
# Inspect relationship (can humidity be used to predict evaporation)
cor(melb_tidy$evaporation, melb_tidy$rh_9am, method="pearson")
## [1] -0.5244449
rh_9am.lm <- lm(evaporation ~ rh_9am, data = melb_tidy)
summary(rh_9am.lm)
##
## Call:
## lm(formula = evaporation ~ rh_9am, data = melb_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6296 -2.0165 -0.5904 1.2617 13.1531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.52522 0.80936 17.95 <2e-16 ***
## rh_9am -0.13478 0.01161 -11.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.977 on 355 degrees of freedom
## Multiple R-squared: 0.275, Adjusted R-squared: 0.273
## F-statistic: 134.7 on 1 and 355 DF, p-value: < 2.2e-16
# Check the linear regression assumptions
plot(rh_9am.lm, which = 1)
cat("The line is quite Linear")
## The line is quite Linear
plot(rh_9am.lm, which = 2)
cat("The line displays evidence of normality. There is a slight upward trend at the upper extreme")
## The line displays evidence of normality. There is a slight upward trend at the upper extreme
plot(rh_9am.lm, which = 3)
cat("The line shows evidence of Homoscedasticity")
## The line shows evidence of Homoscedasticity
# Check the distribution
rh_9am_dist <- ggplot(melb_tidy, aes(rh_9am)) + geom_histogram(col = "black", fill = "blue") + ggtitle(label = "Figure 1.10. Relative Humidity Distribution")
rh_9am_dist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The first linear regression equation will be-
Evaporation = β0 + β1 × min_temp + β2 ×max_temp+ β3 ×rh_9am + β4 ×month + β5 × weekday+ϵ Where:
β0 = Intercept β1 = min_temp β2 = log(max_temp) β3 = rh_9am β4 = month β5 = humidity:month interaction β6 = weekday
# Fit model with all predictors in the bivariate analysis description
evap.lm <- lm(evaporation ~ min_temp + max_temp_log + rh_9am + month + rh_9am:month + weekday, data = melb_tidy)
# Get Quantitative p-values
summary(evap.lm)
##
## Call:
## lm(formula = evaporation ~ min_temp + max_temp_log + rh_9am +
## month + rh_9am:month + weekday, data = melb_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5713 -1.1753 -0.0614 1.0656 11.0593
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.032956 3.090318 2.599 0.00976 **
## min_temp 0.362405 0.044451 8.153 7.8e-15 ***
## max_temp_log 0.193690 0.696177 0.278 0.78102
## rh_9am -0.099191 0.032523 -3.050 0.00248 **
## monthFebruary 1.078303 3.341667 0.323 0.74714
## monthMarch 5.335159 2.631974 2.027 0.04347 *
## monthApril 1.709555 3.105525 0.550 0.58236
## monthMay -4.367959 3.341427 -1.307 0.19206
## monthJune -8.053180 3.969057 -2.029 0.04327 *
## monthJuly -5.088920 3.574263 -1.424 0.15547
## monthAugust -6.440114 3.214306 -2.004 0.04595 *
## monthSeptember -0.627289 3.157471 -0.199 0.84265
## monthOctober -6.324653 3.116914 -2.029 0.04326 *
## monthNovember -1.139358 2.786344 -0.409 0.68288
## monthDecember 0.719722 2.796874 0.257 0.79709
## weekday.L 0.145468 0.308687 0.471 0.63778
## weekday.Q 0.512647 0.312166 1.642 0.10151
## weekday.C 0.433767 0.312792 1.387 0.16647
## weekday^4 0.407605 0.313090 1.302 0.19388
## weekday^5 -0.133451 0.310427 -0.430 0.66756
## weekday^6 -0.100570 0.308842 -0.326 0.74491
## rh_9am:monthFebruary -0.025849 0.050992 -0.507 0.61255
## rh_9am:monthMarch -0.081121 0.039589 -2.049 0.04126 *
## rh_9am:monthApril -0.043431 0.047154 -0.921 0.35771
## rh_9am:monthMay 0.035690 0.047820 0.746 0.45600
## rh_9am:monthJune 0.079459 0.052872 1.503 0.13385
## rh_9am:monthJuly 0.050973 0.051328 0.993 0.32141
## rh_9am:monthAugust 0.080318 0.047449 1.693 0.09147 .
## rh_9am:monthSeptember -0.006416 0.049347 -0.130 0.89664
## rh_9am:monthOctober 0.092106 0.047505 1.939 0.05338 .
## rh_9am:monthNovember 0.015394 0.041743 0.369 0.71253
## rh_9am:monthDecember -0.020011 0.041403 -0.483 0.62919
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.176 on 325 degrees of freedom
## Multiple R-squared: 0.6455, Adjusted R-squared: 0.6117
## F-statistic: 19.09 on 31 and 325 DF, p-value: < 2.2e-16
# Get Categorical p-values
anova(evap.lm)
## Analysis of Variance Table
##
## Response: evaporation
## Df Sum Sq Mean Sq F value Pr(>F)
## min_temp 1 1862.17 1862.17 393.4564 < 2.2e-16 ***
## max_temp_log 1 99.17 99.17 20.9536 6.708e-06 ***
## rh_9am 1 553.12 553.12 116.8691 < 2.2e-16 ***
## month 11 83.37 7.58 1.6013 0.0968739 .
## weekday 6 39.68 6.61 1.3973 0.2150497
## rh_9am:month 11 163.62 14.87 3.1429 0.0004698 ***
## Residuals 325 1538.17 4.73
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using the summary(evap.lm), p-values for quantitative variables are 5.38e-15 (min_temp), 0.850 (max_temp), and 0.002 (rh_9am). Both min_temp and rh_9am have statistically significant p-values.
Using the ANOVA function, p-values for the categorical variables are 0.087 (month), 0.217 (weekday) and 0.0003 (rh9am:month)
# Quantitative
summary(evap.lm)
##
## Call:
## lm(formula = evaporation ~ min_temp + max_temp_log + rh_9am +
## month + rh_9am:month + weekday, data = melb_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5713 -1.1753 -0.0614 1.0656 11.0593
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.032956 3.090318 2.599 0.00976 **
## min_temp 0.362405 0.044451 8.153 7.8e-15 ***
## max_temp_log 0.193690 0.696177 0.278 0.78102
## rh_9am -0.099191 0.032523 -3.050 0.00248 **
## monthFebruary 1.078303 3.341667 0.323 0.74714
## monthMarch 5.335159 2.631974 2.027 0.04347 *
## monthApril 1.709555 3.105525 0.550 0.58236
## monthMay -4.367959 3.341427 -1.307 0.19206
## monthJune -8.053180 3.969057 -2.029 0.04327 *
## monthJuly -5.088920 3.574263 -1.424 0.15547
## monthAugust -6.440114 3.214306 -2.004 0.04595 *
## monthSeptember -0.627289 3.157471 -0.199 0.84265
## monthOctober -6.324653 3.116914 -2.029 0.04326 *
## monthNovember -1.139358 2.786344 -0.409 0.68288
## monthDecember 0.719722 2.796874 0.257 0.79709
## weekday.L 0.145468 0.308687 0.471 0.63778
## weekday.Q 0.512647 0.312166 1.642 0.10151
## weekday.C 0.433767 0.312792 1.387 0.16647
## weekday^4 0.407605 0.313090 1.302 0.19388
## weekday^5 -0.133451 0.310427 -0.430 0.66756
## weekday^6 -0.100570 0.308842 -0.326 0.74491
## rh_9am:monthFebruary -0.025849 0.050992 -0.507 0.61255
## rh_9am:monthMarch -0.081121 0.039589 -2.049 0.04126 *
## rh_9am:monthApril -0.043431 0.047154 -0.921 0.35771
## rh_9am:monthMay 0.035690 0.047820 0.746 0.45600
## rh_9am:monthJune 0.079459 0.052872 1.503 0.13385
## rh_9am:monthJuly 0.050973 0.051328 0.993 0.32141
## rh_9am:monthAugust 0.080318 0.047449 1.693 0.09147 .
## rh_9am:monthSeptember -0.006416 0.049347 -0.130 0.89664
## rh_9am:monthOctober 0.092106 0.047505 1.939 0.05338 .
## rh_9am:monthNovember 0.015394 0.041743 0.369 0.71253
## rh_9am:monthDecember -0.020011 0.041403 -0.483 0.62919
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.176 on 325 degrees of freedom
## Multiple R-squared: 0.6455, Adjusted R-squared: 0.6117
## F-statistic: 19.09 on 31 and 325 DF, p-value: < 2.2e-16
# Categorical
anova(evap.lm)
## Analysis of Variance Table
##
## Response: evaporation
## Df Sum Sq Mean Sq F value Pr(>F)
## min_temp 1 1862.17 1862.17 393.4564 < 2.2e-16 ***
## max_temp_log 1 99.17 99.17 20.9536 6.708e-06 ***
## rh_9am 1 553.12 553.12 116.8691 < 2.2e-16 ***
## month 11 83.37 7.58 1.6013 0.0968739 .
## weekday 6 39.68 6.61 1.3973 0.2150497
## rh_9am:month 11 163.62 14.87 3.1429 0.0004698 ***
## Residuals 325 1538.17 4.73
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Use results summaries to progressively remove statistically insignificant variables.
# Highest p-value: remove max_temp
evap_v2.lm <- lm(evaporation ~ min_temp + rh_9am + month + rh_9am:month + weekday, data = melb_tidy)
summary(evap_v2.lm)
##
## Call:
## lm(formula = evaporation ~ min_temp + rh_9am + month + rh_9am:month +
## weekday, data = melb_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.619 -1.194 -0.085 1.098 11.063
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.621920 2.248209 3.835 0.000151 ***
## min_temp 0.366245 0.042195 8.680 < 2e-16 ***
## rh_9am -0.099383 0.032470 -3.061 0.002391 **
## monthFebruary 1.070387 3.336814 0.321 0.748582
## monthMarch 5.349365 2.627753 2.036 0.042587 *
## monthApril 1.736304 3.099641 0.560 0.575753
## monthMay -4.410731 3.333162 -1.323 0.186667
## monthJune -8.038559 3.963090 -2.028 0.043337 *
## monthJuly -5.201366 3.546311 -1.467 0.143422
## monthAugust -6.473398 3.207531 -2.018 0.044390 *
## monthSeptember -0.610357 3.152414 -0.194 0.846597
## monthOctober -6.286771 3.109529 -2.022 0.044016 *
## monthNovember -1.139353 2.782399 -0.409 0.682452
## monthDecember 0.781062 2.784222 0.281 0.779248
## weekday.L 0.146934 0.308205 0.477 0.633866
## weekday.Q 0.516749 0.311377 1.660 0.097964 .
## weekday.C 0.427073 0.311424 1.371 0.171208
## weekday^4 0.398590 0.310968 1.282 0.200833
## weekday^5 -0.134956 0.309940 -0.435 0.663543
## weekday^6 -0.096727 0.308096 -0.314 0.753759
## rh_9am:monthFebruary -0.025880 0.050920 -0.508 0.611617
## rh_9am:monthMarch -0.081594 0.039496 -2.066 0.039631 *
## rh_9am:monthApril -0.044248 0.046996 -0.942 0.347135
## rh_9am:monthMay 0.035445 0.047744 0.742 0.458383
## rh_9am:monthJune 0.078315 0.052637 1.488 0.137765
## rh_9am:monthJuly 0.051360 0.051236 1.002 0.316883
## rh_9am:monthAugust 0.079649 0.047321 1.683 0.093302 .
## rh_9am:monthSeptember -0.007641 0.049081 -0.156 0.876378
## rh_9am:monthOctober 0.091053 0.047287 1.926 0.055028 .
## rh_9am:monthNovember 0.014932 0.041651 0.358 0.720201
## rh_9am:monthDecember -0.021128 0.041150 -0.513 0.607989
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.172 on 326 degrees of freedom
## Multiple R-squared: 0.6454, Adjusted R-squared: 0.6128
## F-statistic: 19.78 on 30 and 326 DF, p-value: < 2.2e-16
anova(evap_v2.lm)
## Analysis of Variance Table
##
## Response: evaporation
## Df Sum Sq Mean Sq F value Pr(>F)
## min_temp 1 1862.17 1862.17 394.5731 < 2.2e-16 ***
## rh_9am 1 643.14 643.14 136.2746 < 2.2e-16 ***
## month 11 92.50 8.41 1.7819 0.0560042 .
## weekday 6 39.48 6.58 1.3942 0.2162523
## rh_9am:month 11 163.47 14.86 3.1488 0.0004588 ***
## Residuals 326 1538.54 4.72
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Remove next highest p-value: weekday
evap_v3.lm <- lm(evaporation ~ min_temp + rh_9am + month + rh_9am:month, data = melb_tidy)
anova(evap_v2.lm, evap_v3.lm)
## Analysis of Variance Table
##
## Model 1: evaporation ~ min_temp + rh_9am + month + rh_9am:month + weekday
## Model 2: evaporation ~ min_temp + rh_9am + month + rh_9am:month
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 326 1538.5
## 2 332 1570.7 -6 -32.204 1.1373 0.3403
# Remove the interaction term
evap_v4.lm <- lm(evaporation ~ min_temp + rh_9am + month , data = melb_tidy)
anova(evap_v2.lm, evap_v3.lm, evap_v4.lm)
## Analysis of Variance Table
##
## Model 1: evaporation ~ min_temp + rh_9am + month + rh_9am:month + weekday
## Model 2: evaporation ~ min_temp + rh_9am + month + rh_9am:month
## Model 3: evaporation ~ min_temp + rh_9am + month
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 326 1538.5
## 2 332 1570.7 -6 -32.204 1.1373 0.3402699
## 3 343 1741.5 -11 -170.742 3.2889 0.0002698 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using the results from lm model summaries and the ANOVA function, max_temp_log (0.850) and weekday (0.217) were excluded from the model due to statistically insignificant p-values exceeding 0.05. Despite the month variable having a p-value above 0.05, it was retained in the model due to the presence of multiple individual significant values within.
Interestingly, the exclusion of the interaction term rh_9am:month resulted in a statistically significant improvement in the model, with pr values falling from 0.334 in model 3 to 0.0002 in model 4 (model_comparison). This suggests model improvement in capturing variability.
When compared to the bivariate analysis, min_temp and rh_9am were as predicted (statistically significant), while weekday was deemed statistically insignificant. The exclusion of max_temp was interesting. While its p-value was significant in the simple bivariate model (max_temp.lm) it was not when a multi-variate analysis was undertaken (evap_v4.lm). This suggests there are correlations being introduced with the addition of multiple predictor variables. The most likely candidate for this change would be min_temp. It is a similar form of measurement with a much higher statistical significance in this data set.
evap_v4.lm <- lm(evaporation ~ min_temp + rh_9am + month , data = melb_tidy)
model_comparison <- anova(evap.lm, evap_v2.lm, evap_v3.lm, evap_v4.lm)
model_comparison
## Analysis of Variance Table
##
## Model 1: evaporation ~ min_temp + max_temp_log + rh_9am + month + rh_9am:month +
## weekday
## Model 2: evaporation ~ min_temp + rh_9am + month + rh_9am:month + weekday
## Model 3: evaporation ~ min_temp + rh_9am + month + rh_9am:month
## Model 4: evaporation ~ min_temp + rh_9am + month
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 325 1538.2
## 2 326 1538.5 -1 -0.366 0.0774 0.7810204
## 3 332 1570.7 -6 -32.204 1.1341 0.3421197
## 4 343 1741.5 -11 -170.742 3.2796 0.0002799 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The final model is as follows:
Evaporation rate = β0 + β1 × min_temp + β2 × rh_9am + β3 × month + ϵ
With, β0 being the intercept (8.363) β1 being coefficient for min_temp (0.368) β2 being coefficient for rh_9am (-0.096) β3 being the coefficients for each month ϵ being variability within model
Our intercept value is the expected evaporation rate (in millimeters) when all of our predictors are set to zero. So, zero degrees Celsius and 0% relative humidity at 9 am. For the month, we take an average month; a quick look at Figure 1 suggests something like October would be a good candidate!
Next, we have our predictor coefficients. These values indicate how a one-unit change in each predictor influences our evaporation rate (the assumption made here is that the others remain constant). In our model, these 1-unit values are 0.372 for minimum temperature and -0.099 for relative humidity. For our categorical predictors, we select those on the month of interest. For example, a month in summer (February is -0.593) or winter (June is -1.720).
Finally, we need to consider our p-value and model assumptions. We can see minimum temperature and relative humidity are statistically very significant whenever the model is run. However, when looking at the month we have a different picture. Here, only the months of May, June, and July have the same strong statistical relevance. So when running the model, we should acknowledge data from these months will have a greater influence on our evaporation prediction.
# Check there are no NAs in dataset
any(is.na(melb_tidy))
## [1] FALSE
# Double-check
colSums(is.na(melb_tidy))
## Date month weekday min_temp max_temp rh_9am
## 0 0 0 0 0 0
## evaporation max_temp_log
## 0 0
# Inspect Normality
evap_residuals = evap_v4.lm$residuals
residuals <- hist(evap_residuals)
# Plot the residuals
qqnorm <- qqnorm(evap_residuals)
# Plot the Q-Q line
qqline <- qqline(evap_residuals)
The histogram of model residuals shows a symmetrical distribution
centered around zero. This pattern is consistent with a normal
distribution. The QQ plot follows the line from -2 to 1.5 before skewing
upwards. This pattern suggests a generally normal but not perfect
distribution. The upward deviation likely indicates the presence of
outliers in the dataset.
# Inspect Linearity and Homoscedasticity
res <- residuals(evap_v4.lm)
fit <- fitted(evap_v4.lm)
# Residuals vs fitted
res_fit <- plot(fit, res, main = "Figure 1.11. Residuals vs Fitted", xlab = "Fitted Values", ylab = "Residuals")
lines(lowess(fit, res), col = "red")
# Inspect Homoscedasticity
hsced <- plot(evap_v4.lm, which = 3)
The model is not a perfect linear fit; this can be observed at the lower
end (< 0) and upper end (> 8) in the fitted values (Figure 1.11).
In the Scale-Location graph, residuals are again clustered between 0 and
8, indicating general homoscedasticity. Above the fitted value of 8, the
trendline skews up; this indicates increased variability in the upper
end of the dataset.
# Evaluate confidence interval
confint <- confint(evap_v4.lm, conf.level=0.95)
confint
## 2.5 % 97.5 %
## (Intercept) 6.1784984 10.61049437
## min_temp 0.2839277 0.44984344
## rh_9am -0.1150561 -0.07730609
## monthFebruary -1.7590559 0.56836244
## monthMarch -1.2666722 1.01466261
## monthApril -2.4054113 0.03742952
## monthMay -3.1080632 -0.59343645
## monthJune -3.0955206 -0.35905865
## monthJuly -2.9887651 -0.09367104
## monthAugust -2.4071319 0.40461479
## monthSeptember -2.4231238 0.33725518
## monthOctober -1.6549397 0.82412964
## monthNovember -1.3573925 1.05328325
## monthDecember -1.8077151 0.50299792
#1- Predict February 29, 2020. min_temp 13.8, max_temp 23.2 degrees, rh_9am 74%.
predicted_290220 <- predict(
evap_v4.lm,
newdata = data.frame(
min_temp = 13.8,
rh_9am = 74,
month = factor("February", levels = levels(melb_tidy$month))
),
interval = "confidence"
)
predicted_290220
## fit lwr upr
## 1 5.74477 4.880146 6.609395
#2- Predict December 25, 2020, min_temp 16.4, max_temp 31.9, rh_9am 57%
predicted_251220 <- predict(
evap_v4.lm,
newdata = data.frame(
min_temp = 16.4,
rh_9am = 57,
month = factor("December", levels = levels(melb_tidy$month))
),
interval = "confidence"
)
predicted_251220
## fit lwr upr
## 1 8.27674 7.421143 9.132336
#3- Predict January 13, 2020, min_temp 26.5, ma_temp 44.3, rh_9am 35%
predicted_130120 <- predict(
evap_v4.lm,
newdata = data.frame(
min_temp = 26.5,
rh_9am = 35,
month = factor("January", levels = levels(melb_tidy$month))
),
interval = "confidence"
)
predicted_130120
## fit lwr upr
## 1 14.75063 13.59748 15.90377
#4- Predict July 6, 2020, min_temp 6.8, max_temp 10.6, rh_9am 76%
predicted_060720 <- predict(
evap_v4.lm,
newdata = data.frame(
min_temp = 6.8,
rh_9am = 76,
month = factor("July", levels = levels(melb_tidy$month))
),
interval = "confidence"
)
predicted_060720
## fit lwr upr
## 1 2.038338 1.147526 2.92915
# Comparison Table
results <- rbind(
predicted_060720,
predicted_130120,
predicted_251220,
predicted_290220
)
# Convert to dataframe
results <- as.data.frame(results)
# Create coloumns
results <- data.frame(
fit = results$fit,
lwr = results$lwr,
upr = results$upr,
Prediction = c("06-07-2020", "13-01-2020", "25-12-2020", "29-02-2020")
)
# Re-order table columns
results <- results %>%
select(Prediction, fit, upr, lwr)
colnames(results) <- c("Date", "Evaporation_mm", "Upper_95%", "Lower_95%")
# Create table, save to variable
mwc_results <- knitr::kable(results, caption = "<h3>Table 1.1- Predicted Evaporation Rates (mm)", format = "html")
Looking at the predicted values (Table 1.1), evaporation rates range from 14.75mm to 2.03 mm. The highest values are from January, in the middle of summer and are lowest in July, during winter. It is worth noting there is a higher variability in the January prediction (4.878036) compared to the other predictions (which range from 1.71 to 1.78).
The time of year, and therefore weather conditions present play a significant role in predicting evaporation rates.
The Predictions above were run in the model using 95% confidence levels.
We can therefore predict-
Executive Summary
Melbourne Water Corporation (MWC) manages water supply to the city of Melbourne, Australia. In light of shifting weather patterns attributed to global climatic changes, previous estimations of evaporation rates used by MWC have proven unreliable under emerging conditions.
Recognizing the need for accurate predictions to secure Melbourne’s future water supply stability, MWC commissioned PAC Data Analytics to develop updated models (this report).
This study utilises data supplied over a 2 year period from MWC to identify key environmental conditions influencing daily evaporation rates in Melbourne. The identified conditions (predictors) being Minimum temperature, Relative Humidity and Month of the year. Using these and linear regression modelling techniques, a predictive model was built to estimate daily evaporation rates for any given day in Melbourne.
While acknowledging that a two-year dataset offers a limited snapshot of long-term climatic trends, this study underscores the ability to accurately predict evaporation rates from a small subset of environmental factors. Although the overall number of factors can be streamlined, expanding the data collection for these key conditions (Minimum Temperature, Relative Humidity, and Month) is recommended for a more robust analysis.
Methods
Data was systematically organised using principles of data taming, modelling and visualisation. This involved pre-processing, analysis of variables at the bi-variate and multi-variate levels, creation of a linear model and finally production of prediction values.
Materials-
All data analysis, manipulation and modelling was undertaken using the following software: - Microsoft Excel 2019 - R Studio (Version 2023.09.1+494). The following packages were utilised for data analysis * tidyverse * caret * modelr - Data used in the model was sourced from `melbourne.csv´, as provided by MWC
Procedure-
Data pre-processing- The pre-processing (tidy and taming) of the raw data from the melbounre.csv excel file followed the following steps 1) New dataframe was created with all environmental factors outside scope of this study removed 2) Data values were converted to the appropriate data type 3) Data with missing values were ommited from the dataset 4) Additional columns for day, month and year were added.
The final dataset was as follows
head(melb_tidy)
## # A tibble: 6 × 8
## Date month weekday min_temp max_temp rh_9am evaporation max_temp_log
## <chr> <fct> <ord> <dbl> <dbl> <int> <dbl> <dbl>
## 1 2019-01-1 January Tue 15.5 26.2 74 7 3.27
## 2 2019-01-2 January Wed 18.4 22.2 64 7 3.10
## 3 2019-01-3 January Thu 15.9 29.5 75 6.6 3.38
## 4 2019-01-4 January Fri 18 42.6 31 7.8 3.75
## 5 2019-01-5 January Sat 17.4 21.2 63 15.4 3.05
## 6 2019-01-6 January Sun 14.6 22.1 55 6.4 3.10
Bivariate Analysis- The response variable for the purpose of the study was the variable `evaporation´. A list of six predictor variables was supplied by MWC. 2 of these were categorical (month, weekday) and the remaining three quantitative (min_temp, max_temp, rh_9am). The Bivariate analysis is detailed in full in Appendix 2.2.
Each predictor was then examined against the response variable in the following order: 1) The relationship was plotted to visually inspect trends, if any 2) The data spread was examined 3) The response-predictor statistical relationship was examined (can it be used to predict evaporation) 4) The linear regression assumptions were inspected (normality, linearity and homoscedasticity).
At this stage, if a statistically significant p-value was present and the linear assumptions satisfied, the predictor was assumed appropriate at the bi-variate level. If there was no statistically significant p-value, the predictor was not adopted. If the p-value was significant but the linear regression assumptions not satisfied, the following investigation was undertaken:
The 5 predictors included in the bi-variate analysis plus an additional interaction term (upon request of the client), were investigated to determine statistical significance in relation to determining daily evaporation rates. These were:
The Multi-variate analysis was then undertaken in the following steps:
1a). A linear regression model was fit containing all possible predictors. The initial model expression was a follows-
Evaporation = β0 + β1 × min_temp + β2 ×max_temp+ β3 × rh_9am + β4 ×month + β5 × weekday +ϵ
Where, β0 = Intercept β1 = min_temp β2 = log(max_temp) β3 = rh_9am β4 = month β5 = humidity:month interaction β6 = weekday
1b) This was then transformed into the following code in R-studio (Appendix 2.3a)
evap.lm <- lm(evaporation ~ min_temp + max_temp_log + rh_9am + month + rh_9am:month + weekday, data = melb_tidy)
1c) Using this linear model, p-values were obtained using summary function quantitative values and the anova() function for categorical predictors (Appendix 2.3b)
Using these p-values, the predictor with the highest highest p-value was then removed and the model re-run (until all remaining predictors were significant at the 5% level). (Appendix 2.3c)
Steps 2 was repeated until only significant predictors remained.
The final model and significant predictors are-
Evaporation rate = β0 + β1 × min_temp + β2 × rh_9am + β3 × month + ϵ
With, β0 being the intercept β1 being coefficient for min_temp β2 being coefficient for rh_9am β3 being a coefficients for month(s) ϵ being variability within model
Further final model discussion and the bi-variate/multi-variate analysis is provided in Appendix(2.3d)
Model diagnostics-
To determine model accuracy, the following diagnostics were undertaken on the final model
A full description can be found in Appendix 2.3(f)
Predictions-
The model predictions are set to a 95% confidence interval. The co-efficient 95% intervals are detailed in Appendix 2.3(g)
Limitations- The model employs a data set spanning a 2-year period, which may be unsuitable for making long-term climate predictions.
Results
Bivariate and Multi-variate Analysis
During the bivariate analysis the categorical variable month and quantitative variables rh_9am and min_temp were all adopted. The day of week had no statistical relationship with the response variable and was rejected.
The variable max_temp was adopted, however it was log transformed in order to fit normal distribution. This shift from skewed to normal distribution is shown in Figure 1.4 and Figure 1.5. The max_temp v evaporation logged relationship is shown in Figure 1.6. A full description of transformation and regression analysis is available in Appendix 2.2.
maxt_distribution
max_temp_log_dist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
maxt_log_plot
## `geom_smooth()` using formula = 'y ~ x'
At the multi-variate analysis level, max_temp_log (0.850) and weekday
(0.217) were excluded due to statistically insignificant p-values
exceeding 0.05. Despite the month variable having a p-value above 0.05,
it was retained in the model due to the presence of multiple individual
significant values within. The interaction term had 2 statistically
significant months (March, October) however when removed, resulted in a
statistically significant improvement in the model. The pr values
dropped from 0.334 in model 3 to 0.0002 in model 4 (model_comparison).
This suggests model improvement in capturing variability. For the added
complexity of keeping and these benefits, the interaction term was
removed from the model.
The model statistical significance improved throughout each iteration. The final p-value is 0.0002188, which is very low. The F-statistic also improved throughout each iteration (3.3452). The RSS value slightly increased with each iteration (1738.7). This suggests some unexplained variance in the model exists.
model_comparison
## Analysis of Variance Table
##
## Model 1: evaporation ~ min_temp + max_temp_log + rh_9am + month + rh_9am:month +
## weekday
## Model 2: evaporation ~ min_temp + rh_9am + month + rh_9am:month + weekday
## Model 3: evaporation ~ min_temp + rh_9am + month + rh_9am:month
## Model 4: evaporation ~ min_temp + rh_9am + month
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 325 1538.2
## 2 326 1538.5 -1 -0.366 0.0774 0.7810204
## 3 332 1570.7 -6 -32.204 1.1341 0.3421197
## 4 343 1741.5 -11 -170.742 3.2796 0.0002799 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When compared to the bivariate analysis, min_temp and rh_9am were as predicted (statistically significant), while weekday was deemed statistically insignificant. The exclusion of max_temp was interesting. While its p-value was significant in the simple bivariate model (max_temp.lm) it was not when a multi-variate analysis was undertaken (evap_v4.lm). This suggests there are correlations being introduced with the addition of multiple predictor variables. The most likely candidate for this change would be min_temp. It is a similar form of measurement with a much higher statistical significance in this data set.
Model Interpretation-
The final model and their 1 unit prediction values (co-efficients) were:
Evaporation rate = β0 + β1 × min_temp + β2 × rh_9am + β3 × month + ϵ
With, β0 being the intercept (8.363) β1 being coefficient for min_temp (0.368) β2 being coefficient for rh_9am (-0.096) β3 being the coefficients for each month ϵ being variability within model
The starting point of this model is determining the evaporation rate when all of our predictors are set to zero((intercept value). In the context of our model, when the temperature is 0 degrees Celsius and relative humidity is 0%, our evaporation rate will be 8.363 mm. (Note for the month coefficient, we just take an average month. A quick look at Figure 1 suggests something like October would be a good candidate!
The next part of our model is the predictor coefficients (min_temp, rh_9am and month). These values indicate how a one-unit change in each influences our evaporation rate (the assumption made here is that the others remain constant). In our model, these 1-unit values are 0.372 for minimum temperature and -0.099 for relative humidity. For our categorical predictors, we select the month of interest. For example, a month in summer (February is -0.593) or winter (June is -1.720). A full breakdown of this and our model assumptions can be found in Appendix 2.3(e). “)
Discussion
The linear model was applied to predict evaporations rates in Melbourne under the following 4 scenarios-
February 29, 2020, if this day has a minimum temperature of 13.8 degrees and reaches a maximum of 23.2 degrees, and has 74% humidity at 9am.
December 25, 2020, if this day has a minimum temperature of 16.4 degrees and reaches a maximum of 31.9 degrees, and has 57% humidity at 9am.
January 13, 2020, if this day has a minimum temperature of 26.5 degrees and reaches a maximum of 44.3 degrees, and has 35% humidity at 9am.
July 6, 2020, if this day has a minimum temperature of 6.8 degrees and reaches a maximum of 10.6 degrees, and has 76% humidity at 9am.
It should be noted maximum temperature values is not a desired predictor for the evap_v4.lm model. Maximum temperature was removed from the model to address introduced correlations. A full discussion on this topic can be found in Appendix 2.3d
The evap_v4.lm model predicts daily evaporation rates in mm to 95% confidence levels. The prediction, upper and lower end confidence levels are provided in Table 1.1.
mwc_results
| Date | Evaporation_mm | Upper_95% | Lower_95% |
|---|---|---|---|
| 06-07-2020 | 2.038338 | 2.929150 | 1.147526 |
| 13-01-2020 | 14.750626 | 15.903770 | 13.597482 |
| 25-12-2020 | 8.276739 | 9.132336 | 7.421143 |
| 29-02-2020 | 5.744770 | 6.609395 | 4.880146 |
Looking at the predicted values (Table 1.1), evaporation rates are 14mm January 13th, 5mm February 29, 2mm July 6 and 8mm December 25. There is a range of 12.7mm across the predicted dates (14.754350-2.038645).
The highest evaporation rates are evident in January, the middle of summer and lowest is in July, during winter. This is as would be expected in the context of the model predictors. For example, during the bi-variate analysis the following relationships were identified- - Higher minimum temperatures results in higher evaporation with an inverse relationship for lower temperatures (Figure 1.7) - Lower humidity results in higher evaporation with the inverse true for higher humidity (Figure 1.9) - January is the month with highest evaporation. July is the second lowest (Figure 1.1)
It can therefore be concluded the time of year, and therefore weather conditions, play a significant role in determination of evaporation rates. From a water resource management perspective, if looking to minimise evaporation losses, the optimal period to focus on would be during the summer period.
mint_plot
## `geom_smooth()` using formula = 'y ~ x'
humid_plot
## `geom_smooth()` using formula = 'y ~ x'
evap_plot
In the second scenario of the study, MWC are interested in comparing predicted results to a 10mm evaporation trigger event at the Cardinia Reservoir. The model concludes- January 13 2020 can be predicted with 95% confidence to trigger this event February 29 2020, July 6 2020 and December 25 can be predicted with 95% confidence to not trigger this event
It is important to note variability is present in the model. Although the data is generally normally distributed in the final model, the QQ plot and Scale Location graphs in Appendix 2.3f reveal outliers, particularly at the upper end of the dataset. The presence of these outliers emphasise the importance of recognising the limitations of the model in capturing this variability. As the particular emphasis of this study is capturing climate change effects more effectively, addressing variability in the upper end of the dataset requires further exploration. Collecting additional data is a potential avenue for reducing this observed variability.
Conclusion
Using the dataset provided by Melbourne Water Corporation (MWC), a robust linear regression model (evap_v4.lm) was developed to predict daily evaporation rates in Melbourne with 95% confidence (Appendix 2.3d). From this analysis, the key predictors required for determination of evaporation rates in the model were found to be minimum temperature, relative humidity and month (Appendix 2.3c).
In regards to addressing MECs key predictions concerning specific dates and environmental conditions for the year 2020, the study produced the following key findings: 1. Daily evaporation rates were 14mm on January 13th, 5mm February 29, 2mm July 6 and 8mm December 25 (Appendix 2.3g); 2. Of these days, January 13th would trigger the MWC daily 10mm evaporation event at Cardinia reservoir (Appendix 2.3h); and 3. Of these days, February 29, July 6 and December 25 would not trigger the MWC daily 10mm evaporation event at Cardinia reservoir (Appendix 2.3h).
These use of evap_v4.lm provides valuable insights into what variables are useful for understanding and predicting daily evaporation patterns. Model predictions provide clear quantitative contributions to the decision making process of the MWC water resource management team.